36 research outputs found
Watch and Learn: Semi-Supervised Learning of Object Detectors from Videos
We present a semi-supervised approach that localizes multiple unknown object
instances in long videos. We start with a handful of labeled boxes and
iteratively learn and label hundreds of thousands of object instances. We
propose criteria for reliable object detection and tracking for constraining
the semi-supervised learning process and minimizing semantic drift. Our
approach does not assume exhaustive labeling of each object instance in any
single frame, or any explicit annotation of negative data. Working in such a
generic setting allow us to tackle multiple object instances in video, many of
which are static. In contrast, existing approaches either do not consider
multiple object instances per video, or rely heavily on the motion of the
objects present. The experiments demonstrate the effectiveness of our approach
by evaluating the automatically labeled data on a variety of metrics like
quality, coverage (recall), diversity, and relevance to training an object
detector.Comment: To appear in CVPR 201
Cross-stitch Networks for Multi-task Learning
Multi-task learning in Convolutional Networks has displayed remarkable
success in the field of recognition. This success can be largely attributed to
learning shared representations from multiple supervisory tasks. However,
existing multi-task approaches rely on enumerating multiple network
architectures specific to the tasks at hand, that do not generalize. In this
paper, we propose a principled approach to learn shared representations in
ConvNets using multi-task learning. Specifically, we propose a new sharing
unit: "cross-stitch" unit. These units combine the activations from multiple
networks and can be trained end-to-end. A network with cross-stitch units can
learn an optimal combination of shared and task-specific representations. Our
proposed method generalizes across multiple tasks and shows dramatically
improved performance over baseline methods for categories with few training
examples.Comment: To appear in CVPR 2016 (Spotlight
Evaluating Text-to-Image Matching using Binary Image Selection (BISON)
Providing systems the ability to relate linguistic and visual content is one
of the hallmarks of computer vision. Tasks such as text-based image retrieval
and image captioning were designed to test this ability but come with
evaluation measures that have a high variance or are difficult to interpret. We
study an alternative task for systems that match text and images: given a text
query, the system is asked to select the image that best matches the query from
a pair of semantically similar images. The system's accuracy on this Binary
Image SelectiON (BISON) task is interpretable, eliminates the reliability
problems of retrieval evaluations, and focuses on the system's ability to
understand fine-grained visual structure. We gather a BISON dataset that
complements the COCO dataset and use it to evaluate modern text-based image
retrieval and image captioning systems. Our results provide novel insights into
the performance of these systems. The COCO-BISON dataset and corresponding
evaluation code are publicly available from \url{http://hexianghu.com/bison/}
A Simple Recipe for Competitive Low-compute Self supervised Vision Models
Self-supervised methods in vision have been mostly focused on large
architectures as they seem to suffer from a significant performance drop for
smaller architectures. In this paper, we propose a simple self-supervised
distillation technique that can train high performance low-compute neural
networks. Our main insight is that existing joint-embedding based SSL methods
can be repurposed for knowledge distillation from a large self-supervised
teacher to a small student model. Thus, we call our method Replace one Branch
(RoB) as it simply replaces one branch of the joint-embedding training with a
large teacher model. RoB is widely applicable to a number of architectures such
as small ResNets, MobileNets and ViT, and pretrained models such as DINO, SwAV
or iBOT. When pretraining on the ImageNet dataset, RoB yields models that
compete with supervised knowledge distillation. When applied to MSN, RoB
produces students with strong semi-supervised capabilities. Finally, our best
ViT-Tiny models improve over prior SSL state-of-the-art on ImageNet by
and are on par or better than a supervised distilled DeiT on five downstream
transfer tasks (iNaturalist, CIFAR, Clevr/Count, Clevr/Dist and Places). We
hope RoB enables practical self-supervision at smaller scale
MonoNeRF: Learning Generalizable NeRFs from Monocular Videos without Camera Pose
We propose a generalizable neural radiance fields - MonoNeRF, that can be
trained on large-scale monocular videos of moving in static scenes without any
ground-truth annotations of depth and camera poses. MonoNeRF follows an
Autoencoder-based architecture, where the encoder estimates the monocular depth
and the camera pose, and the decoder constructs a Multiplane NeRF
representation based on the depth encoder feature, and renders the input frames
with the estimated camera. The learning is supervised by the reconstruction
error. Once the model is learned, it can be applied to multiple applications
including depth estimation, camera pose estimation, and single-image novel view
synthesis. More qualitative results are available at:
https://oasisyang.github.io/mononerf .Comment: ICML 2023 camera ready version. Project page:
https://oasisyang.github.io/mononer
Generating Natural Questions About an Image
There has been an explosion of work in the vision & language community during
the past few years from image captioning to video transcription, and answering
questions about images. These tasks have focused on literal descriptions of the
image. To move beyond the literal, we choose to explore how questions about an
image are often directed at commonsense inference and the abstract events
evoked by objects in the image. In this paper, we introduce the novel task of
Visual Question Generation (VQG), where the system is tasked with asking a
natural and engaging question when shown an image. We provide three datasets
which cover a variety of images from object-centric to event-centric, with
considerably more abstract training data than provided to state-of-the-art
captioning systems thus far. We train and test several generative and retrieval
models to tackle the task of VQG. Evaluation results show that while such
models ask reasonable questions for a variety of images, there is still a wide
gap with human performance which motivates further work on connecting images
with commonsense knowledge and pragmatics. Our proposed task offers a new
challenge to the community which we hope furthers interest in exploring deeper
connections between vision & language.Comment: Proceedings of the 54th Annual Meeting of the Association for
Computational Linguistic
Learning by Asking Questions
We introduce an interactive learning framework for the development and
testing of intelligent visual systems, called learning-by-asking (LBA). We
explore LBA in context of the Visual Question Answering (VQA) task. LBA differs
from standard VQA training in that most questions are not observed during
training time, and the learner must ask questions it wants answers to. Thus,
LBA more closely mimics natural learning and has the potential to be more
data-efficient than the traditional VQA setting. We present a model that
performs LBA on the CLEVR dataset, and show that it automatically discovers an
easy-to-hard curriculum when learning interactively from an oracle. Our LBA
generated data consistently matches or outperforms the CLEVR train data and is
more sample efficient. We also show that our model asks questions that
generalize to state-of-the-art VQA models and to novel test time distributions